Skip to content

Conversation

@GaneshPatil7517
Copy link
Contributor

Which issue does this PR close?

Closes #19789

Rationale for this change

When executing a partitioned hash join with a wide build table (many columns), the current PartitionMode::Partitioned approach adds a RepartitionExec that copies all columns of the build table via take_arrays. This is expensive when only the join key columns are needed for partitioning.

What changes are included in this PR?

This PR introduces a new PartitionMode::LazyPartitioned that avoids the full build-side RepartitionExec. Instead:

  • Build side: Requests UnspecifiedDistribution (no repartitioning). All partitions are merged via CoalescePartitionsExec.
  • Probe side: Still requests HashPartitioned distribution (repartitioned as before).
  • Lazy filtering: During hash table construction, rows are filtered using hash(join_keys) % partition_count == current_partition, ensuring each partition only builds its relevant subset.

Key changes:

File Change
joins/mod.rs Added LazyPartitioned variant to PartitionMode enum
joins/hash_join/exec.rs Added PartitionFilter struct and filter_batch_by_partition() function; updated required_input_distribution() and execute()
joins/hash_join/stream.rs Updated pattern matches for new mode
joins/hash_join/shared_bounds.rs Updated SharedBuildAccumulator handling
physical-optimizer/enforce_distribution.rs Added LazyPartitioned to key reordering logic
physical-optimizer/join_selection.rs Added swap handling for new mode
proto/datafusion.proto Added LAZY_PARTITIONED = 3
proto/src/physical_plan/mod.rs Added serialization/deserialization

Are these changes tested?

Yes. All existing tests pass:

  • ✅ 765 hash join tests
  • ✅ 16 physical optimizer tests
  • ✅ 2 proto roundtrip tests (including roundtrip_hash_join)

Are there any user-facing changes?

No breaking changes. This adds a new PartitionMode::LazyPartitioned option that can be explicitly selected for hash joins where the build table is wide but the join key is narrow. The existing Partitioned, CollectLeft, and Auto modes remain unchanged.

Performance Impact

For wide build tables, LazyPartitioned avoids copying non-join-key columns during repartitioning, reducing memory allocations and CPU overhead. The trade-off is that each partition now scans all build rows (but only retains those matching its partition).

@github-actions github-actions bot added optimizer Optimizer rules proto Related to proto crate physical-plan Changes to the physical-plan crate labels Jan 14, 2026
@GaneshPatil7517 GaneshPatil7517 force-pushed the feature/lazy-partitioned-hash-join branch from e361a74 to 420e207 Compare January 14, 2026 11:46
@Dandandan
Copy link
Contributor

run benchmark tpch tpcds

PartitionMode::LazyPartitioned => {
// LazyPartitioned mode: build side is NOT repartitioned (we read all
// partitions and filter locally), but probe side IS hash-partitioned.
let right_expr = self.on.iter().map(|(_, r)| Arc::clone(r)).collect();
Copy link
Contributor

@Dandandan Dandandan Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • we can probably implement it for both sides (vec![ Distribution::UnspecifiedDistribution, Distribution::UnspecifiedDistribution, ] to also save the RepartitionExec on the right side
  • I think we should strive to replace PartitionMode::Partitioned with the new implementation (just a faster version).
  • We should compute the hashes and indices for each partition only once instead of for each partition (to avoid making many partitions slow).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok ill update it.....

Copy link
Contributor

@Dandandan Dandandan Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it is even better to start with (only avoid repartitioning the probe side, I meant this originally but put build side in the description):

vec![ Distribution::Hash(left_expr), Distribution::UnspecifiedDistribution, ]

as it probably will be needed any way to redistribute/prepare the left side and the probe side will be often the larger one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok sir ill update that also.....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I created an (AI-based) variant on the idea in #19812 and put some bench results in the description

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing this. I’ll review the AI-based variant and the benchmarks you added and see how it compares with this approach.

…c overhead

This commit adds a new PartitionMode::LazyPartitioned that avoids the
full build-side RepartitionExec when executing partitioned hash joins.
Instead of pre-repartitioning all columns of the build table, rows are
filtered lazily during hash table construction using hash(join_keys) %
partition_count.

Key changes:
- Add LazyPartitioned variant to PartitionMode enum
- Build side requests UnspecifiedDistribution (merged, no repartition)
- Probe side still requests HashPartitioned distribution
- Add filter_batch_by_partition() to filter build rows per partition
- Update collect_left_input to accept optional partition filter
- Add protobuf serialization support for new mode
- Update optimizer to handle LazyPartitioned in key reordering

This optimization is beneficial for wide build tables where copying
all columns in RepartitionExec is expensive.

Closes apache#19789
@GaneshPatil7517 GaneshPatil7517 force-pushed the feature/lazy-partitioned-hash-join branch from 08e8ef5 to 6d62650 Compare January 14, 2026 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

optimizer Optimizer rules physical-plan Changes to the physical-plan crate proto Related to proto crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reduce overhead of Partitioned hash join

2 participants